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RESEARCH REPORT 

Bootstrapping Development of a Cloud-Based Spoken Dialog 
System in the Educational Domain From Scratch Using 
Crowdsourced Data 

Vikram Ramanarayanan , 1 David Suendermann-Oeft , 1 Patrick Lange , 1 Alexei V. Ivanov , 1 
Keelan Evanini , 2 Zhou Yu , 3 Eugene Tsuprun , 2 & Yao Qian 1 

1 Educational Testing Service, San Francisco, CA 

2 Educational Testing Service, Princeton, NJ 

3 Carnegie Mellon University, Pittsburgh, PA 


We propose a crowdsourcing-based framework to iteratively and rapidly bootstrap a dialog system from scratch for a new domain. We 
leverage the open-source modular HALEF dialog system to deploy dialog applications. We illustrate the usefulness of this framework 
using four different prototype dialog items with applications in the educational domain and present initial results and insights from 
this endeavor. 

Keywords Spoken dialog systems; crowdsourcing; computer-assisted language learning; automated assessment 
doi:10.1002/ets2.12105 


Spoken dialog systems (SDSs) consist of multiple subsystems, such as automatic speech recognizers (ASRs), spoken lan¬ 
guage understanding (SLU) modules, dialog managers (DMs), and spoken language generators, among others, interacting 
synergistically and often in real time. Each of these subsystems is complex and brings with it design challenges and open 
research questions in its own right. Rapidly bootstrapping a complete, working dialog system from scratch is therefore a 
challenge of considerable magnitude. Apart from the issues involved in training reasonably accurate models for ASR and 
SLU that work well in the domain of operation in real time, one has to ensure that the individual systems also work well in 
sequence such that the overall SDS performance does not suffer and guarantees an effective interaction with interlocutors 
who call into the system. 

The ability to rapidly prototype and develop such SDSs is important for applications in the educational domain. For 
example, in automated conversational assessment, test developers might design several conversational items, each in a 
slightly different domain or subject area. One must, in such situations, be able to rapidly develop models and capabilities 
to ensure that the SDS can handle each of these diverse conversational applications gracefully. This is also true in the case 
of learning applications and so-called formative assessments: One must be able to quickly and accurately bootstrap SDSs 
that can respond to a wide variety of learner inputs across domains and contexts. Language learning and assessments 
add yet another complication in that systems need to deal gracefully with nonnative speech. Despite these challenges, the 
increasing demand for nonnative conversational learning and assessment applications makes this avenue of research an 
important one to pursue; however, this requires us to find a way to rapidly obtain data for model building and refinement 
in an iterative cycle. 

Crowdsourcing is one solution that allows us to overcome this obstacle of obtaining data rapidly for iterative model 
building and refinement. Crowdsourcing has been used to rapidly and cheaply obtain data for a number of spoken lan¬ 
guage applications in recent years, such as native (Suendermann-Oeft, Liscombe, 8c Pieraccini, 2010) and nonnative 
(Evanini, Higgins, 8c Zechner, 2010) speech transcription and evaluation of quality of speech synthesizers (Buchholz 
8c Latorre, 2011; Wolters, Isaac, 8c Renals, 2010). Crowdsourcing, and particularly Amazon Mechanical Turk, has also 
been used for assessing SDSs and for collecting interactions with SDSs. In particular, McGraw, Lee, Hetherington, Seneff, 
and Glass (2010) evaluated an SDS with a multimodal web interface using MIT’s WAMI toolkit, collecting more than 
1,000 dialog sessions. Rayner, Frank, Chua, Tsourakis, and Bouillon (2011) tested a computer-assisted language learning 
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(1) Crowdsourced Data Collection using Amazon Mechanical Turk 
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Figure 1 Proposed crowdsourcing-based iterative bootstrapping setup for rapid spoken dialog system development. 


application with spoken input with altogether 129 interactions. Jurccek et al. (2011) deployed a phone-based SDS for the 
restaurant information domain, collecting 923 calls. However, to our knowledge, crowdsourcing has not been applied to 
the iterative development of a SDS (and its components), particularly in the educational domain, previously. 

Therefore the goal of this report is to propose an iterative framework wherein a spoken (and potentially multimodal) 
dialog system can be architected from very generic models to more domain-specific models in a continuous development 
cycle and to present the initial results of an ongoing successful deployment of such a bootstrapped dialog system that is 
tailored to educational domain applications. 


The HALEF Dialog Ecosystem 

We use the open-source HALEF dialog system 1 to develop conversational applications within the crowdsourcing frame¬ 
work. Please see Figure 1 for a schematic overview of this framework. Because the HALEF architecture and compo¬ 
nents have been described in detail in prior publications (Ramanarayanan, Suendermann-Oeft, Ivanov, & Evanini, 2015; 
Suendermann-Oeft, Ramanarayanan, Teckenbrock, Neutatz, & Schmidt, 2015), we only briefly mention the various mod¬ 
ules of the system here: 

• Telephony servers Asterisk (van Meggelen, Smith, Sc Madsen, 2009) and FreeSWITCH (Minessale, Schreiber, 
Collins, & Chandler, 2012), which are compatible with Session Initiation Protocol (SIP), Public Switched Tele¬ 
phone Network (PSTN), and web Real-Time Communications (WebRTC) standards and include support for voice 
and video 

• A voice browser, JVoiceXML (Schnelle-Walka, Radomski, & Miihlhauser, 2013), which is compatible with 
VoiceXML 2.1 and can process SIP traffic and which incorporates support for multiple grammar standards, such 
as Java Speech Grammar Format (JSGF), Advanced Research Projects Agency (ARPA), and Weighted Finited State 
Transducer (WFST) 

• An Media Resource Control Protocol (MRCP) speech server (Prylipko, Schnelle-Walka, Lord, Sc Wendemuth, 
2011), Cairo, which allows the voice browser to initiate SIP or Real-Time Transport Protocol (RTP) connections 
from/to the telephony server and incorporates two speech recognizers (Sphinx and Kaldi; see respectively Lamere 
et al., 2003; Povey et al., 2011) and synthesizers (Mary and Festival; see respectively Schroder Sc Trouvain, 2003; 
Taylor, Black, Sc Caley, 1998). 
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Age 

Figure 2 Average age of Turkers who called in to the spoken dialog system using Amazon Mechanical Turk. The bar on top represents 
the mean ± 1 standard deviation. 

• An Apache Tomcat-based web server, 2 which can host dynamic VoiceXML pages, web services, and media libraries 
containing grammars and audio files 

• OpenVXML, 3 a VoiceXML-based voice application authoring suite: generates dynamic web applications that can 
be housed on the web server 

• A MySQL 4 database server for storing call logs 

• A speech transcription, annotation, and rating portal that allows one to listen to and transcribe full-call recordings, 
rate them on a variety of dimensions such as caller experience and latency, and perform various semantic annotation 
tasks required to train ASR and SLU modules 

Because we are bootstrapping a dialog system from scratch, we used generic models for ASR (trained on the Wall Street 
Journal corpus) and rule-based models (defined as part of the dialog flow in VXML, using the OpenVXML software) for 
SLU and DM. Although we plan to use the data flowing in continuously to iteratively refine statistical models for ASR, 
SLU, and DM in the future, for the purposes of this report, we focus only on the initial models. 

Crowdsourcing Data Collection 

We used Amazon Mechanical Turk for our crowdsourcing data collection experiments. Each spoken dialog task was its 
own individual human intelligence task (called a HIT on Amazon Mechanical Turk). In addition to reading instructions 
and calling in to the system, users were requested to fill out a 2- to 3-minute survey regarding the interaction. There were 
no particular restrictions on who could do the spoken dialog task, as we did not want to constrain the pool of people 
calling in to the system initially. As we continue to develop better models, we plan to restrict this pool of speakers to 
nonnative speakers of English. For the initial study that we report here, however, participants were mostly native speakers 
of American English (there were only 14 nonnative speakers) hailing from all over the continental United States; 43% 
were male, whereas 57% were female, and participants were well distributed across age groups (see Figure 2). In all, we 
collected 676 production calls over approximately 1 month of data collection, amounting to approximately 23 hours of 
speech. 

Spoken Dialog Tasks 

We deployed four conversational tasks for the purposes of this experiment: two tasks that tested pragmatic appropriateness 
of responses spoken during common scenarios in the workplace, a job interview, and a pizza-order-taking scenario. 

One workplace pragmatics item involved interactive schedule negotiation, requiring several exchanges of information. 
The caller s task was to study and comprehend a weekly meeting schedule (provided as stimulus material) and then respond 
appropriately to an automated coworker who was trying to schedule a lunch meeting with the caller. The caller then dialed 
in to the system and proceeded to answer the sequence of questions posed by the automated coworker. Depending on the 
semantic class of the caller s answer to each question (as determined by the output of the speech recognizer and the 
natural language understanding module), he or she was redirected to the appropriate branch of the dialog tree, and the 
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Density Distribution of Time Spent on Tasks 


0.020 - 



Figure 3 Distributions of call handling times (call durations) for each of the different items deployed. 


conversation continued until all such questions were answered. This item was designed to measure the caller s ability to 

(a) understand the visual and oral stimuli (a work schedule and the automated coworker s questions and responses) and 

(b) politely and appropriately accept and decline invitations. 

Another similar task involved testing how pragmatically appropriate callers’ responses were in accepting or declining 
an offer of food in the workplace. Yet another task provided callers/test takers with a sample resume stimulus and acted 
as a job seeker in an interview with an automated interviewer. Please see Ramanarayanan et al. (2015) for more detailed 
call-flow schematics corresponding to these tasks. 

Whereas the three aforementioned tasks were system-initiated dialog scenarios, the fourth involved user-driven dialog. 
In this task, callers were required to act as customer service representatives at a pizza restaurant and to take an order from 
an automated customer who wanted to order a pizza. In the scenario, the automated customer waited for the user to ask a 
question (“what is your name?” “what toppings would you like on your pizza?” etc.) before replying with the appropriate 
response. Therefore this task might have been harder than the other three, imposing more cognitive load on the user. 

Qualitative and Quantitative Performance Analysis 

Figure 3 and Table 1 depict the distributions of call durations (or call handling times) and call completion rates, respec¬ 
tively, for each of the four items deployed. The pragmatics items were much shorter than the caller-initiated pizza item or 
the interview item, but as might be expected, they had higher completion rates. This was possibly because (a) there were 
more dialog states in the latter two items as compared to the first two or (b) the relatively more open-ended nature of 
the interview questions elicited longer and more spontaneous responses from callers. The longer items with more dialog 
states were also more likely to run into system issues at this initial stage of deployment and therefore had lower call com¬ 
pletion rates. However, as Figure 4 shows, completion rates for the longer items improved statistically significantly over 
time (p «.01). This graph in particular shows the usefulness and effectiveness of the iterative development framework, 
which allowed us to find and correct issues with the system (whether they were in the VXML call flows, system code, or 
models) and redeploy the system to obtain rapid feedback about the modifications made. 

To better understand how the system performed when actual test takers call in, we asked all Turkers to rate various 
aspects of their interaction with the system on a 5-point scale ranging from 1 (least satisfactory) to 5 (most satisfactory). 
The results of this user evaluation are depicted in Figure 5. We further had expert reviewers listen to each of the full-call 
recordings, examine the call logs, and rate each call on a range of dimensions (Suendermann-Oeft, Liscombe, Pieraccini, 
8c Evanini, 2010). Histograms of these ratings are also shown in Figure 5 and include the following: 

• Audio quality of system responses. This metric measured, on a scale from 1 to 5, how clear the automated agent was. 
A poor audio quality would be marked by frequent dropping in and out of the automated agents voice or by muffled 
or garbled audio. 
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Table 1 Completion Rate for Each of the Four Items Deployed 


Item 

No. dialog states 

No. calls 

Completion rate (%) 

Pragmatics (food offer) 

1 

131 

61.83 

Pragmatics (scheduling) 

3 

166 

66.87 

Job interview 

8 

192 

35.42 

Pizza customer service 

7 

187 

47.06 



Figure 4 Completion rates over time for the two longer items deployed (interview and pizza), depicted by filled circles. The days are 
in chronological order but not necessarily consecutive. Note the increasing trend, depicted in red. The linear regression slope was 
significant at the 95% level (p «.01), and a left-sided Wilcoxon rank sum test showed that the completion rates after the ninth day were 
significantly higher than those on or before (p « .001). 


• Qualitative latency score. A score measuring how debilitating the average delay is between the automated agents 
response from the time the user finishes speaking to the conversation. 

• Caller experience. A qualitative measure of the caller s experience using the automated agent, with 1 for a very bad 
experience and 5 for a very good experience. 

• Caller cooperation. A qualitative measure of the caller s cooperation, or the caller s willingness to interact with the 
automated agent, with 1 for no cooperation and 5 for fully cooperative. 

We observed that most users provided a high median rating to the extent they were able to complete their calls (4) as 
well as for the intelligibility of the system audio (5). Overall, users felt that the system performed well, with a median self- 
rated caller experience rating of 4. Experts tended to agree with user ratings in these cases, with similar median ratings for 
caller experience and audio quality. However, there was still plenty of scope for improvement with respect to how easy it 
was to understand the system prompts and how appropriate they were, with a median rating of 3. The median user rating 
of 3 (“satisfactory”) for the system understanding category is not surprising, given that we are using unsophisticated rule- 
based grammars and natural language understanding. Another interesting observation is that users tended to find the 
system latency more debilitating on average than experts did (median rating of 3 vs. 4) on listening to full-call recordings. 
We will continue to investigate this going forward as more calls come in and system enhancements are made. 

Conclusions and Outlook 

We have presented a crowdsourcing framework that allows rapid prototyping and iterative development of dialog sys¬ 
tem components and models. Such a framework allows one to iteratively improve the system over time, as seen by the 
improvement of call completion rates over time in our case (Figure 4). This is because the framework enables developers 
to rapidly identify and resolve multiple issues in call flows, the source code of various components, and various modeling 
enhancements, such as the addition of semantic classes or the modification of synthesis voices. The exciting aspect of hav¬ 
ing a continuous influx of data and system feedback is that this opens up many avenues for ongoing and future research, 
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Figure 5 (top) User ratings, (bottom) Expert ratings. 

including, but not limited to, better statistical models for ASR, SLU, and DM; iterative improvements to the item design; 
and parallelization and other code enhancements to improve system robustness and efficiency. 
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Notes 

1 https://sourceforge.net/p/halef/ 

2 http://tomcat.apache.org/ 

3 https://github.com/OpenMethods/OpenVXML/ 

4 https://www.mysql.com/ 
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